data input

First, we input the data and required package, then standardized the totalRooms and totalBedrooms by the households.

Also, I will continue split dataset into two section.

Tree model

CART model

we first use CART model, include all variable exclude longitude and latitude, When we were building the model, we split only if that tree have at least 30 obs in a node, and the split improves the fit by a factor of 0.00001.

our tree models are (we include standardized totalRooms and totalBedrooms : rpart(medianHouseValue ~ . - longitude - latitude - totalRooms - totalBedrooms)

we can get the tree plot and cross-validated error as:

Then use function to pick the smallest tree, and get the minimum rmse:

## [1] 77166.22

random forests

we use random forests to do the estimation, based on same dependent variables, and we get its RMSE:

## [1] 67555

gradient-boosted trees

we use gradient-boosted trees model to do the estimation, based on same dependent variables, and we get its RMSE. now we average over 100 bootstrap samples, this time use all candidate variables (mtry=6) in each bootstrapped sample

## [1] 68643.87

so we can see that random forest has smaller rmse, we will continue use random forest model to estimate the medianHouseValue

predict the result and make plots

a plot of the original data

we can get the plot result as:

a plot of your model’s predictions of medianHouseValue

a plot of your model’s errors

So overall, we can see that our estimations are higher than the real value, it always happen in southern area of CA.